Skip to content

Conversation

@prtkgaur
Copy link

@prtkgaur prtkgaur commented Dec 5, 2025

Co-authored-by: [email protected]

Rationale for this change

ALP significantly improves on the compression ratio and decompression speed over of float/double columns over other encoding/compression techniques.

What changes are included in this PR?

This PR
Introduces ALP (pseudo-decimal) encoding into c++ arrow code.
Adding above needed us to add

  • Sampler
  • Encoder for ALP data
  • Decoder for ALP data
  • Benchmarks and dataset to prove the effectiveness of the above algorithm.

Are these changes tested?

  • We have added unit tests to test the code.
  • Also the benchmarks have been added that cover wide variety of floating point values from low precision to high precision.

Are there any user-facing changes?

  • It's a new encoding so the only impact is query performance which we claim will only get better.

@github-actions
Copy link

github-actions bot commented Dec 5, 2025

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@prtkgaur prtkgaur force-pushed the gh540-alp-pseudoDecimal-encoding branch from ed922b2 to 4c50497 Compare December 6, 2025 21:34
@prtkgaur prtkgaur force-pushed the gh540-alp-pseudoDecimal-encoding branch from 4c50497 to 48fd8fc Compare December 6, 2025 21:52
@prtkgaur prtkgaur force-pushed the gh540-alp-pseudoDecimal-encoding branch from 1b78a5c to d563ce0 Compare December 7, 2025 15:46
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the more standard place to put test data is in either arrow-testing or parquet-testing so it can be used across implementations

In this case I would recommend https://github.com/apache/parquet-testing

DELTA_BYTE_ARRAY = 7,
RLE_DICTIONARY = 8,
BYTE_STREAM_SPLIT = 9,
ALP = 10,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@alamb
Copy link
Contributor

alamb commented Dec 8, 2025

Thanks @prtkgaur -- it is super exciting to see this movement.

Unfortunately, I am not familiar with the C/C++ codebase to give this a realistic review.

I started the CI checks on this PR and had some comments about the testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants